Comprehensive and realistic simulation of tumour genomic sequencing data

Abstract Accurate identification of somatic mutations and allele frequencies in cancer has critical research and clinical applications. Several computational tools have been developed for this purpose but, in the absence of comprehensive ‘ground truth’ data, assessing the accuracy of these methods is challenging. We created a computational framework to simulate tumour and matched normal sequencing data for which the source of all loci that contain non-reference bases is known, based on a phased, personalized genome. Unlike existing methods, we account for sampling errors inherent in the sequencing process. Using this framework, we assess accuracy and biases in inferred mutations and their frequencies in an established somatic mutation calling pipeline. We demonstrate bias in existing methods of mutant allele frequency estimation and show, for the first time, the observed mutation frequency spectrum corresponding to a theoretical model of tumour evolution. We highlight the impact of quality filters on detection sensitivity of clinically actionable variants and provide definitive assessment of false positive and false negative mutation calls. Our simulation framework provides an improved means to assess the accuracy of somatic mutation calling pipelines and a detailed picture of the effects of technical parameters and experimental factors on somatic mutation calling in cancer samples.


INTRODUCTION
The identification of somatic mutations from highthroughput sequencing (HTS) data plays a critical role in scientific r esear ch and clinical oncology.Cancer dri v er mutations continue to inform prognosis, guide therapy and shape our understanding of how cancer de v elops and e volv es.Experimental design and analytical decisions, such as sequencing depth ( 1 ), target ( 2 ) and the choice of bioinformatics pipelines (3)(4)(5)(6)(7)(8), all influence the power and accuracy of somatic mutation detection.Assessing their effects on the r ecover ed soma tic muta tion landscape r equir es HTS r efer ence data containing a 'ground truth' set of somatic mutations for which the location and source of all loci that contain non-r efer ence bases (a base call that does not match the r efer ence genome at that locus) are known.Such data are a key component in the validation and benchmarking of mutation calling pipelines.The accuracy of the results returned by somatic mutation calling pipelines is critical for many r esear ch and clinical applications; howe v er, studies benchmar king these pipelines often publish inconsistent results (9)(10)(11)(12), indicative of the many challenges faced in this area.The variant allele frequency (VAF) spectrum across all somatic mutations is also relevant for understanding cancer origin and evolution ( 13 ) and the de v elopment of tr eatment r esistance ( 14 ), as well as for inferring clinically important metrics such as tumour mutation burden (TMB), purity and ploidy ( 15 ).Significantly, despite the clinical and r esear ch r elevance of individual mutation frequencies and the VAF spectrum as a whole, no studies to date have attempted to assess the accuracy with which the frequencies of somatic variants ar e inferr ed by mutation callers, and this r epr esents a significant knowledge gap.
A number of computational tools have been de v eloped to pr ovide gr ound truth data to assess the accuracy of somatic mutation callers.Sequencing read simulators, such as ART ( 16 ), can be used to generate reads from a r efer ence genome.These are then modified to introduce 'somatic' variants (a process known as spiking in variants) using software such as BAMSurgeon ( 17 ).Although convenient, a r efer ence-based approach does not reflect the di v erse sources of variation within real sequencing data.Increasingly, this is being addressed through a 'hybrid' solution employing both real and synthetic sequences ( 17 ).Real sequencing data are subsampled into two sets, corresponding to a virtual matched tumour and normal pair, and somatic variants are then spiked into the tumour reads.Howe v er, in addition to the variants that are purposely spiked in, these data also contain low-frequency somatic variants present in the source sample from which they were deri v ed, as well as sequencing errors and other artefacts.The precise origin of this additional variation is likely to be unknown, creating difficulties for the evaluation of mutation callers.
The computational tools that are currently used to spike in soma tic muta tions also have significant limita tions.The number of reads that are edited to introduce the mutant allele at the r equir ed locus is typically determined by the product of specified mutation frequency and the sequencing depth at the site.This fails to take account of stochastic aspects of the sequencing process.In reality, sequence r eads ar e a random sample of the DNA at a locus and the observed number of reads containing the alternate allele is, ther efor e, a random variable.Failur e to take this into account can result in spike-in bias, where a variant allele is always spiked in if the product of the VAF and the sequencing depth is > 1 and ne v er otherwise.For e xample, a site designated to contain a somatic mutation with a frequency of 10%, which is sequenced to a depth of 30 reads, will always contain exactly 3 reads with the alternate allele if the mutation is spiked in with the widely used tool, BAMSurgeon ( 17 ).Howe v er, the number of sequence reads containing the alternate allele for a somatic mutation of this frequency may be > 3 or < 3 in real data.
Here, we describe a comprehensi v e and stochastic tool for simulating tumour sequencing data that, unlike existing methods, enables precise determination of the accuracy and power to detect a somatic mutation as a function of its actual frequency within the cancer sample.Analysis of simula ted da ta genera ted using this frame wor k pro vides no vel insights into the relationship between the true somatic mutation frequency spectrum and the empirical frequency spectrum obtained following application of a mutation caller.Using our simulations, we assess mutation caller bias in VAF estimates and demonstrate the empirical somatic mutation frequency distribution corresponding to somatic mu-tations deri v ed from a neutral model of tumour evolution.We also perform a comprehensi v e assessment of false positi v e and false negati v e soma tic muta tion calls, made possible by the fact that our simulation tool provides the source of all non-r efer ence alleles in the dataset.

MATERIALS AND METHODS
A personalized r efer ence genome containing all germline single-nucleotide variants (SNVs) and indels annotated for 1000 Genomes donor HG00110 (female of English and Scottish ancestry) was created.This was used as a base to simulate normal and pre-tumour (i.e.before the somatic variants have been spiked in) genomic sequencing data using the ART read simulator configured with default empirical err or pr ofile and corr esponding to differ ent depths of coverage, 100 ×, 200 ×, 350 × and 600 ×.All reads in the SAM output generated by ART are aligned to their true location within the personalized phased genome from which they were simulated.The target simulated was an exome capture consisting of all hg38 exons plus an additional 100 base pairs at the 3 and 5 ends of each capture range.These data were then used as a base to spike in the r equir ed somatic distribution of SNVs.Once the spike-in process was completed, the phased data (a BAM pair corresponding to the ma ternal / pa ternal haplotype set) were merged and realigned against a standard r efer ence (Figur e 1 ).Realignment was performed using BWA (v0.7.17) ( 18 ) with the hg38 r efer ence genome.Somatic variant calling was performed using Mutect2 according to GATK (4.2.2.0) Best Practices for somatic short variant discovery ( 19 ).Crosssample contamination and base quality scor e r ecalibration stages were not run as the data were simulated without contamina tion or systema tic biases.Mutect2 was called with argument set to ensure all filtered variants were recorded in the variant call format (VCF) file ( tumor-lod-toemit = 0 ).The stochastic simulation frame wor k is written in C, on the HTSlib 1.13 API ( 20 ).

Simulation of mutation allele frequency spectrum
We used our simulation frame wor k to simulate somatic mutations with defined frequencies.These simulations included the full mutant allele frequency spectrum of a diploid tumour expected under a neutral evolutionary model ( 13 ).We also simulated a low-frequency, high-burden point mass at a fixed frequency and finally a uniform distribution of soma tic muta tion frequencies to investiga te caller detection rate and inferred allele frequency as a function of true allele frequency.The first two simulations were repeated across different depths of coverage (100 ×, 200 ×, 350 × and 600 ×) to explore the effect of depth on soma tic muta tion detection and inferred allele frequency.The uniform distribution was simulated at a fixed depth of 100 ×.Finally, to illustrate the distinction between somatic variant simulation using BAMSurgeon and our simulation frame wor k, the point-mass simulations were repeated using BAMSurgeon to spike in the r equir ed burden.
Complete allele spectrum simulation of a diploid tumour derived from a neutral e v olutionary model.A total b urden of 2681 somatic variants within a true frequency range of 0.01-0.25 was simulated, with a VAF spectrum for subclonal mutations corresponding to the neutral model of evolution ( 13 ) (Figure 2 A) (i.e.cumulati v e distribution function of the mutation frequency, f , proportional to 1 / f ).The clonal mutations were simulated with a fixed frequency of 0.5 (the simulation had 100% tumour purity).Mutations were spiked into each of the four pre-tumour phased BAM pairs (with depths of 100 ×, 200 ×, 350 × and 600 ×).Mutect2 was then run on the resulting matched tumour-normal pairs.The resulting variant output (VCF) was compared to the ground truth values.

Lo w-fr equency, high-burden point-mass somatic distribution.
A low-frequency burden of 10 000 somatic variants, all with a true allele frequency of 0.035, was spiked into each of the four pre-tumour phased BAM pairs (100 ×, 200 ×, 350 × and 600 ×) using the stochastic simulation frame wor k.Mu-tect2 was then run on the matched tumour-normal pairs.The resulting variant output (VCF) was correlated against its ground truth values (again using the simulation framewor k).The observ ed VAF spectrum was plotted showing the dispersion of allele frequencies about their ground truth at each of the four sequencing depths.
Uniform somatic distribution.We divided the first half of the allele frequency range (i.e.frequencies from 0 to 0.5) into 100 semi-centile bins.Into each bin, we spiked in a uniform distribution of 10 000 somatic variants at loci randomly distributed across the target region in 100 × pre-tumour HTS data.For each semi-centile, we recorded the percentage of the ground truth burden that was passed (considered true somatic) or filtered (considered artefactual) and its associated allele frequencies, as inferred by Mutect2.From this, we created a matrix detailing the detection rate for each semi-centile and the regions of the spectrum in which that burden was detected, as annotated by Mutect2.This enables us to predict how the caller performs in detecting a true burden and identifying its associated allele frequencies.The reasons candidate somatic variants were filtered by the caller in each semi-centile were also recorded.The simulations were carried out in groups of four semicentiles per simulation, with each simulation containing a total burden of 40 000 somatic variants, yielding 25, 100 ×,

Simulations of FFPE and 8-o xoG ar tef acts .
We performed additional simulations that included artefacts that are typical of formalin-fixed, paraffin-embedded (FFPE) samples and oxidati v e DN A damage.To sim ulate artefacts associated with FFPE samples, we downloaded high-coverage (minimum depth ≥500) colorectal variant call data ( 21 ) from three patients, each containing two samples, one fresh frozen (FF) and the other FFPE (48 h fixation time), both of which had been resected from the same tumour.Using variants identified only in the FFPE sample, we estimated the FFPE single-base substitution (SBS) signature (based on the conventional 96 triplet mutation types) associated with these data and created a context-specific FFPE distribution of simulated artefacts.We then created a second distribution based on COSMIC signature SBS45 ( 22 ) to simulate oxidati v e damage during sample preparation.Both distributions were combined with an empirical set of DNA damage artefact allele frequencies ( 21 ).To preserve the required orientation, the 100 × pre-tumour BAMs were each split into two separate files, one with reads from inserts that aligned to the forward genomic strand and the other containing the remaining reads.The target burden, totalling 7 332 528 DNA damage artefacts, was then divided evenly between forward and re v erse strand alignment BAM files and spiked in using the stochastic simulation frame wor k.All files were merged back into the final tumour BAM on completion and realigned against the hg38 r efer ence befor e being subjected to variant calling, using Mutect2 in tumour-normal mode.These simulations did not include any somatic mutations and, consequently, any variants identified by the caller were false positi v es.

RESULTS
The simulation frame wor k de v eloped her e, which we r efer to as stochasticSim, has significantly enhanced functionality compared to existing methods (Table 1 ).A key feature is the fully comprehensi v e truth set.Truth sets based on data deri v ed from controlled mixtures of distinct samples (usually cell lines) or by spiking in mutations into a single sample contain germline variants, sequencing errors and alignment errors, among other artefacts.They also contain true soma tic muta tions present, usually a t low frequencies, in the original samples.As a consequence, these simulation methods do not provide an accurate and complete set of the true soma tic muta tions in the sample ( 24 , 25 ).This is r equir ed, f or example, f or an accurate assessment of the perf ormance of methods to detect soma tic muta tions.In contrast, a comprehensi v e truth set not only allows us to identify true and false positi v es definiti v ely, but also enab les us to identify the cause of all false positi v e calls.
Our simulation frame wor k provides a complete record of the source of all non-reference bases in the data.This allows us to assess the sources of all false positi v e and false negati v e mutation calls.In the first simulation, a total of 2681 somatic mutations were simulated across a range of fr equencies corr esponding to a neutral model of tumour evolution ( 13 ) (Figure 2 A), with sequencing depths ranging from 100 × to 600 ×.The false positi v e rate was extremely low, with just one false positi v e (due to sequencing error) being detected over four simulations at different sequencing depths.Over all detection r a tes a t each of the four sequencing depths (100 ×, 200 ×, 350 × and 600 ×) were 22%, 28%, 32% and 36%, respecti v ely.The relati v ely low proportion of mutations detected in this simulation reflects the large proportion of low-frequency variants implied by the neutral model of tumour evolution ( ∼75% of the mutations had a true allele frequency < 0.05).As the number of reads carrying the alternati v e allele is a random variab le, a number of mutant alleles were not found at all in the simulated data and ther efor e would not be detected by any mutation calling pipeline.At each of the four sequencing depths (100 ×, 200 ×, 350 × and 600 ×), 24%, 13%, 8% and 4%, respecti v ely, of the true somatic burden received no coverage of the alternati v e allele at the variant locus.The majority of the remaining true somatic variants were present at too low frequencies to be considered by Mutect2 for filtering and were dropped without leaving any record in the VCF file (Figure 2 ).
A small proportion of all somatic variants ( ∼2% at 600 ×), most of which were also of low allelic frequency, were missed as the reads supporting the alternati v e allele wer e incorr ectly aligned against the r efer ence genome.A substantial number of the true somatic mutations were removed by mutation caller filters that incorrectly identified them as artefacts (Figure 2 B).The contribution of different filters varied across sequencing depths.Interestingly, the number of true soma tic muta tions tha t failed these filters increased with increasing sequencing depth.This was mainly due to the alternati v e allele being incorrectly flagged as a germline or other artefact common to both the tumour and normal samples ( normal artifact ).As sequencing depth increases, so too does the probability of a read error in the normal sample occurring at the same locus and with the same base as a true somatic variant in the tumour, thereby increasing the number of variants filtered in this way.

Probability of somatic mutation detection as a function of mutation frequency
To investigate the relationship between the probability of detecting a somatic mutation and its frequency in the tumour sample, we simulated somatic mutations distributed uniformly over a range of frequencies from 0 to 0.5.The true number of somatic variants together with the total number detected by the caller in each semi-centile was recorded, allowing us to assess the sensitivity to detect somatic variants over a range of allele frequencies.As expected, the detection rate (defined as the probability of a true somatic mutation being passed by the caller) was a strong function of the simulated variant frequency (Figure 2 C).The normal artifact filter accounted for ∼34% of true somatic variants incorrectly filtered by Mutect2 with a significant contribution at lower frequencies ( < 0.2) (Figure 2 ).The number of true somatic mutations removed by the clustered events filter increased with increasing frequency (Figure 2 D).This number was relati v ely high in these simulations due to the high burden of mutations simulated in each frequency band (40 000 somatic variants, randomly distributed across a 76 Mb target region).This high burden increased the probability of multiple somatic mutations being spiked in in close proximity, resulting in them being flagged by this filter.A small number of true somatic mutations wer e filter ed due to strand bias, but this was not noticeable beyond an allele frequency of 0.1 and it should be noted that no strand-specific artefacts were simulated.

Empirical mutation frequency spectrum corresponding to a neutral model of tumour evolution
Information on how tumours e volv e is relevant for gaining a better understanding of cancer origin, the de v elopment of immune evasion and resistance to treatment.Models of tumour e volution hav e implications for the frequency spectrum of somatic mutations observed in a cancer sample; howe v er, the relationship between a theoretical frequency spectrum and the empirical spectrum that is observed when mutations are called using existing computational pipelines is unclear, particularly in the case of moderate sequencing depth.We assessed the mutation frequency spectrum recovered by the caller for the simulations corresponding to the neutral model of evolution ( 13 ) over a range of sequencing depths (100 ×, 200 ×, 350 × and 600 ×).These empirical distributions differ qualitati v ely for different sequencing depths, with lower depth simulations having a much higher proportion of muta tions a t intermedia te frequencies than predicted by the neutral model of tumour evolution (Figure 3 ).As expected, the observed frequency spectrum r esembled mor e closely the expected form (with a cumulati v e distribution function proportional to the reciprocal of the frequency) at the higher sequencing depths.The inferr ed fr equencies of somatic mutations are also relevant for the calculation of TMB, which is typically defined as the number of somatic SNVs per megabase (mut / Mb) with an inferr ed fr equency ≥0.05.The ground truth TMB for the diploid tumour in this simulation was 8.5 mut / Mb (equivalent to 645 variants with a true frequency ≥0.05, at 100% tumour purity for a 76 Mb target).TMB was estimated at each of the four sequencing depths (100 ×, 200 ×, 350 × and 600 ×) as 6.99, 7.70, 7.84 and 7.84 mut / Mb, respecti v ely, with a 12% increase in estimated TMB between 100 × and 600 ×.

Misestimation of mutation frequencies
To illustrate the impact of stochastic effects on the estimation of somatic mutation frequencies, we simulated 10 000 soma tic muta tions a t a fix ed fr equency of 0.035.The detection rate (i.e. percentage of the somatic mutations annotated as PASS by Mutect2) at sequencing depths of 100 ×, 200 ×, 350 × and 600 × was 20%, 33%, 45% and 54%, respecti v ely, with v ery small numbers of false positi v e somatic muta tions a t each depth (6, 2, 1 and 3, respecti v ely, with one of the false positi v es resulting from incorrect read alignment and the remainder from sequencing error).The mean inferr ed fr equencies r eturned by the caller wer e 0.065, 0.050, 0.044 and 0.040, illustrating an upward bias (relati v e to the true frequency of 0.035; Figure 4 A), which decreases with increasing sequencing depth.The bias results from the relationship between the probability of detecting a somatic mutation and the number of reads containing the mutation.As we move to the right of the spectrum (Figure 4 B), the fraction of the variants r ecover ed (ratio of the height of the pass to ground truth histograms) increases.This results in an allele frequency distribution for pass variants with a mode that is shifted to the right, relati v e to the ground truth distribution.We have identified similar biases in VAFs previously, in the context of the use of a mutation frequency threshold in the calculation of TMB ( 27 ).Howe v er, as seen in these simulations, a threshold is not r equir ed to observe a bias in the inferred variant frequencies (which are estimated from the read fractions).We also used these simulations to compare our stochastic simulation toolkit with read fraction methods of simula ting HTS da ta by repea ting the simulation using BAMSurgeon ( 17 ) to spike in the required distribution.A similar total burden was detected by Mutect2 from both simulations (BAMSurgeon totals: 2130, 3016, 4376 and 5303; stochastic simulation totals: 2008, 3349, 4473 and 5398); howe v er, ther e wer e substantial differences in the VAF spectrum associated between the two cases (Supplementary Figure S2).

Simulations of FFPE and 8-oxoG artefacts
FFPE is a method of tissue preservation enabling samples to be stored at room temperature almost indefinitely ( 28 ).The procedure also creates asymmetric DNA damage such as deamination of cytosine to uracil (resulting in the detection of C > T transitions) ( 21 ).Similarly, oxidati v e DNA damage introduced during sample preparation, for example, as a by-product of acoustic shearing ( 29 ), generates 8-o x oguanine (8-o x oG) leading to G > T transversions.Both types of DN A damage usuall y manifest as a high burden of low-frequency artefacts in sequencer output.Only a  made it through all filters r epr esented only a very small proportion of the original number of sites at which artefacts were simulated, this number could have a substantial impact on results in many studies.A large-scale empirical analysis of TMB in over 100 tumour types indicated a median value of 2.7 mut / Mb ( 30 ).This would translate as 205 somatic mutations on the simulation target used in this study.Together with 42 artefacts incorrectly annotated as PASS, this would imply a false positi v e rate of 20%.As expected, the mutational profile detected by Mutect2 resembles an FFPE and 8-o x oG DNA damage signature (Supplementary Figure S3).Interestingly, the total median depth at true negati v e loci (where the DNA damage artefact was correctly filtered by Mutect2) was 95, as opposed to 50 at false positi v e loci (where the damage was incorrectly passed).The median number of reads supporting the alternati v e allele recorded by the caller was 5 at true negati v e loci and 3 at false positi v e loci, with median alternati v e allele frequencies of 0.054 and 0.084, respecti v ely.In the case of true negati v es, 4043 out of a total of 4169 true negati v es contained evidence supporting the candidate somatic allele on one genomic strand only.The r emaining r ecords contained evidence from both genomic strands with the evidence from one strand caused by sequence error.In the case of false positi v es, howe v er, only 30 out of 42 recorded evidence of the alternate allele from one genomic strand only, suggesting interaction between false positi v es arising from DNA damage and the occurrence of the same substitution due to sequencing error on the opposite strand ( P = 5e −09, from Fisher's exact test).In effect, in the case of 12 false positi v es, a sequence error enabled the DNA damage artefact to escape the orientation bias filter as evidence of the alternati v e allele was present on both genomic strands.To explore the scenario in w hich histolo gicall y normal tissue adjacent to the FFPE tumour sample is used as a control, we spiked in the same le v el of FFPE and 8-o x oG burden to the normal sample and re-ran the somatic variant calling pipeline.This simulation yielded similar results (34 false positi v es with 12 showing evidence of the alternate allele on both strands).

DISCUSSION
We have developed a computational framework for simulating personalized, phased, cancer genome sequencing data tha t crea tes a soma tic SBS cancer distribution in a base BAM file containing all germline indel and SNV variation from a 1000 Genomes donor.Cancer indel and structural variants are not yet simulated by this frame wor k.Our frame wor k provides a comprehensi v e report on the sources of all non-reference sites in the simulated data and accounts for the randomness in the number of reads that contain the non-r efer ence allele at somatic mutation sites.We have applied this framework to assess the performance of a widely used pipeline to call somatic mutations.In agreement with previous reports ( 3 , 9 , 31 ), our initial analysis indicated that the GATK4 Mutect2 pipeline had a very low rate of false positi v e mutation calls.Howe v er, pre-analytical factors, particularly those associated with sample storage and preparation, can significantly im-pact downstream somatic variant analysis.Artefacts introduced in these stages are o verlook ed by current bioinforma tics simula tion methods (Table 1 ) and not accounted for in their assessment of caller specificity.To illustrate this, we tested the GATK orientation bias filter against simulated FFPE and 8-o x oG damaged sequencing data and demonstrated a mechanism by which orientation bias artefacts escape GATK filtering leading to additional caller false positi v es.We also quantified the number of false negati v es, corresponding to true somatic variants that wer e incorr ectly filtered by Mutect2, and investigated biases in allele frequency estimation.
Misestimation of allele frequency may be of particular scientific and clinical relevance.We have pr eviously r eported a bias in the inferred mutation frequency when onl y m utations with an observed frequency greater than a thr eshold ar e consider ed ( 27 ).Despite the absence of an explicit threshold, simulations in this paper re v eal a similar allele frequency bias resulting from the dependence of mutation detection probability on the number of reads that support the mutant allele (Figure 4 ).The mutation frequencies at which this bias is observed decrease with increasing depth of coverage (Figure 4 ).The novel simulation framewor k enab led us to inv estigate such biases in realistic simula ted da ta using a commonl y a pplied somatic m utation calling pipeline.It also allowed us, for the first time, to determine the observed frequency spectrum that results when mutations from a theoretical spectrum corresponding to a model of tumour evolution are called from cancer sequencing data.
We found that a substantial number of true somatic variants were excluded by the caller as a consequence of being incorrectly identified as an artefact common to both tumour and normal samples.Mutect2 filters candidate variants based on minimal evidence of their presence in the normal sample ( normal artifact ), e v en when the variant is present at much greater frequency in the cancer sample.This means, along with filtering a number of (primarily germline) true negati v es, true somatic variants can also be remov ed due , for example , to sequencing errors in the normal sample.Our sim ulations, w hich used the same depth of coverage in both tumour and normal samples, demonstrated that this can be a significant issue, particularly at high depths.For example, one variant was incorrectly flagged as normal artifact based on its detection by Mutect2 at an allele frequency of 0.00054 in the 600 × normal sample.The allele in the normal sample was, in fact, a sequencing error.The median allele frequencies in the normal sample for which normal artifact false negati v es were excluded at depths 100 ×, 200 ×, 350 × and 600 × were 0.0064, 0.0039, 0.0022 and 0.0016, respecti v ely.In practice, the normal sample is often sequenced to a lower depth than the cancer sample.This would reduce the number of mutations that are lost in this way.Howe v er, some assays r equir e the same depth of coverage in the normal sample (e.g.copy number analysis), while another publication recommends as high a depth of coverage as possible in both tumour and normal samples ( 32 ).We recommend manual curation of variants filtered solel y as normal artifact , particularl y w here they may be of clinical relevance.

CONCLUSION
High-confidence identification of somatic mutations in tumour samples and accurate inference of their frequencies are important for clinical decision-making and in cancer r esear ch.Realistic simulations continue to play a key role in this regard, improving our understanding of the performance of computational pipelines that have been designed to identify somatic mutations.The extremely low false positi v e rates achie v ed by somatic variant callers such as Mu-tect2 ( 3 , 9 , 31 ) are in part enabled by an extensive set of filtering steps designed to remov e artefacts.Howe v er, we hav e demonstra ted tha t strict thr esholds enfor ced by some of these filters come at a price, in terms of power, with some true mutations being flagged by the filters.We have highlighted limitations in existing methods of assessing the false positi v e rates of mutation callers and demonstrated a mechanism through which DNA damage introduced during the pre-analytical phase of the sequencing process can lead to false somatic mutation calls.We have also quantified the extent of the bias in the estimated frequencies of the soma tic muta tions tha t are identified, as a function of sequencing depth, and determined the empirical mutant frequency spectrum corresponding to the neutral model of tumour evolution.Our simulations also allow us to predict caller detection rate as a function of allele frequency.This novel simulation tool can be applied to evaluate the accuracy with which individual mutations or mutation burdens are calculated and to compare the observed frequencies of somatic mutations to their expected distribution under competing models of tumour evolution.

DA T A A V AILABILITY
The software and results relating to this publication are available at Zenodo with doi:10.5281/ zenodo.8155004.The stochastic simulation frame wor k is also availab le from https://github.com/BrianOSullivanGit/stochasticSim .

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Cancer Online.

Figure 1 .
Figure 1.Tumour stochastic HTS simulation frame wor k.Personalized phased donor genome incorporates all single-nucleotide polymorphisms (SNPs) and indels recorded from any 1000 Genomes donor.

Figure 2 .
Figure 2. ( A ) The 'ground truth' or true frequency spectrum of soma tic muta tions in our simulation of a diploid tumour deri v ed from a neutral evolutionary model.The total true burden was 2681 somatic variants.( B ) Number of true somatic variants incorr ectly filter ed by Mutect2 as artefacts, stacked by filter type, from neutral model simula tions a t each of the four sequencing depths, 100 ×, 200 ×, 350 × and 600 ×. ( C ) Probability that a true soma tic muta tion is passed by Mutect2 as a function of its true allele frequency for 100 × sequencing data with 100% tumour purity.This simulation was also repeated over a range of depths on a reduced target size (Supplementary Figure S1).( D ) Number of true somatic variants incorrectly filtered as artefacts in the uniform frequency simulations, stacked by filter type.Each bar r epr esents the fractions of false negati v es incorrectly e x cluded b y the caller from a total somatic burden of 40 000 variants.

Figure 3 .
Figure 3. VAF plots from Mutect2 output of a diploid tumour deri v ed from a neutral evolutionary model, overlaid on the ground truth.Ground truth burden is faded where it starts to extend beyond the y -axis.Depth of coverage is as indicated in each plot.

Figure 4 .
Figure 4. ( A ) The VAF distribution as inferred by Mutect2 from simulated data consisting of 10 000 somatic mutations, each with a true allele frequency of 0.035.The blue arrow indicates the true allele frequency at which the somatic burden is located (the ground truth).Each of the overlay plots indicates what is inferred by Mutect2 at the sequencing depths indicated.The data have been processed using the smooth.splinefunction from the base R stats package( 26 ).The same data are also available without smoothing (Supplementary FigureS4).( B ) Explanation of the somatic variant low-frequency caller bias, as annotated by Mutect2 for the 100 × data from the previous panel.

Table 1 .
Comparison of the functionality of tumour simulation methods